We’ll create variables using different R data types:
The c() function is used to combine multiple values into a single vector.
Purpose: To store quantitative values that can be used for arithmetic operations. This line creates a numeric vector named age. It contains three values: 25, 30, and 22, which represent ages.
age <- c(25, 30, 22)
The values “Alice”, “Bob”, and “Carol” are strings (text)
names <- c("Alice", "Bob", "Carol")
This creates a logical (boolean) vector. The values represent whether each person is a student or not. TRUE and FALSE are R’s built-in logical constants.
is_student <- c(TRUE, FALSE, TRUE)
Factors are ideal for data analysis involving groups or categories (e.g., grades, gender, regions). This creates a factor vector, which is used for categorical data. R treats factors as enumerated types, which can be ordered or unordered. You’ve explicitly set the levels (categories) from A to F, even though only A and B are used here.
grades <- factor(c("A", "B", "A"), levels = c("A", "B", "C", "D", "F"))
list_types <- list(
Age = typeof(age),
Names = typeof(names),
StudentStatus = typeof(is_student),
Grades = typeof(grades)
)
list_types
## $Age
## [1] "double"
##
## $Names
## [1] "character"
##
## $StudentStatus
## [1] "logical"
##
## $Grades
## [1] "integer"
students_df <- data.frame(
Name = names,
Age = age,
Student = is_student,
Grade = grades,
stringsAsFactors = FALSE
)
knitr::kable(students_df, caption = "Student Information Data Frame")
| Name | Age | Student | Grade |
|---|---|---|---|
| Alice | 25 | TRUE | A |
| Bob | 30 | FALSE | B |
| Carol | 22 | TRUE | A |
info_list <- list(
Names = names,
Ages = age,
Status = is_student
)
for (item in names(info_list)) {
cat("\n🔹", item, ":\n")
print(info_list[[item]])
}
##
## 🔹 Names :
## [1] "Alice" "Bob" "Carol"
##
## 🔹 Ages :
## [1] 25 30 22
##
## 🔹 Status :
## [1] TRUE FALSE TRUE
students_df$Adult <- students_df$Age >= 25
knitr::kable(students_df, caption = "Updated Data Frame with 'Adult' Column")
| Name | Age | Student | Grade | Adult |
|---|---|---|---|---|
| Alice | 25 | TRUE | A | TRUE |
| Bob | 30 | FALSE | B | TRUE |
| Carol | 22 | TRUE | A | FALSE |
The
irisdataset includes measurements of 150 iris flowers from three species. This report combines visualizations, and statistical methods for an end-to-end analysis.
First, let’s get a feel for the data
In R, there are many ways to load data, depending on the format of your data. Here we will load data from the dataset that come with R using the
library()method
library(datasets)
head(iris)will show you the first 6 rows, showing the column names and a sample of the data.
head(iris)
str()show the type of data frame, number of observations (rows) and the number of the variables (columns). It will also show the data type of each column (e.g., num for numeric, Factor for species).
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
The
name()displays the colunm headings for the dataset
names(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
The
summary()provides the minimum (min), 1st quartile, median, mean, 3rd quartile, and maximum (max) for the numerical columns, and the counts for each column (level of the Species factor).
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
We examine summary statistics to understand the central tendency and dispersion of each numeric variable. Summary of numeric features to understand central tendency and spread. Loads the psych package, which provides more detailed descriptive statistics.
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
library(psych)
describe(iris[, 1:4])
Compare mean feature values across species.
aggregate(. ~ Species, data = iris[, 1:5], mean)
To visualise the data we use the
plot()method.
A scatter plot showing the relationship between Sepal Length and Petal Length, colored by species.
plot(iris$Petal.Length)
The
plot()function has many arguments you can use to customize your graph. Here are some of the most common ones:
plot(iris$Sepal.Length, iris$Petal.Length,
main = "Sepal vs Petal Length",
xlab = "Sepal Length",
ylab = "Petal Length",
col = "blue",
pch = 16
)
To reveal how Petal Length varies with Sepal Length across species. Helpful in spotting clusters or patterns. Sepal Length (X) vs Petal Length (Y). Points are colored by Species (setosa, versicolor, virginica). A legend shows which color corresponds to which species.
plot(iris$Sepal.Length, iris$Petal.Length,
col = iris$Species,
pch = 16,
main = "Sepal vs Petal Length by Species",
xlab = "Sepal Length",
ylab = "Petal Length")
legend("topleft", legend = levels(iris$Species), col = 1:3, pch = 16)
Quickly compare the central tendency and spread of Petal Length by species. A boxplot displays the distribution of Petal Length across the three species, highlighting medians and variability. Our box plot shows the Petal Length by Species
boxplot(Petal.Length ~ Species, data = iris, col = c("lightblue", "lightgreen", "lightpink"))
A histogram helps visualize the frequency distribution of Sepal Width for all flowers. Checks if Sepal Width is normally distributed or skewed, and to see the spread of values. Below is the Histogram of Sepal Width
hist(iris$Sepal.Width, col = "skyblue", border = "white")
This matrix of scatterplots shows pairwise relationships among all four numeric variables, with species color-coded
pairs(iris[1:4], main = "Pairwise Plot", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])
A
ggplot2()version of the scatter plot offering cleaner visualization and built-in legends.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
geom_point(size = 2) +
labs(title = "Petal vs Sepal Length", x = "Sepal Length", y = "Petal Length") +
theme_minimal()
A density plot compares the distribution of Sepal Width across species using smooth curves.
ggplot(iris, aes(x = Sepal.Width, fill = Species)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Sepal Width") +
theme_minimal()
You can use the
plotly()package:
### install.packages("plotly")
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(iris, x = ~Sepal.Length, y = ~Petal.Length,
type = 'scatter', mode = 'markers',
color = ~Species)
We compute correlations between numeric variables to assess linear relationships.
cor(iris[, 1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
cor_matrix <- cor(iris[, 1:4])
heatmap(cor_matrix, col = colorRampPalette(c("white", "blue"))(20),
main = "Correlation Matrix of Iris Features")
ANOVA tests if the mean Petal Length is significantly different across the three species. We will use the
aov()function
anova_model <- aov(Petal.Length ~ Species, data = iris)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## Species 2 437.1 218.55 1180 <2e-16 ***
## Residuals 147 27.2 0.19
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After an ANOVA test tells us that at least one group mean is different, it does not tell us which groups differ. This is where Tukey’s Honest Significant Difference (HSD) test
TukeyHSD()comes in.
anova_model <- aov(Petal.Length ~ Species, data = iris)
TukeyHSD(anova_model)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Petal.Length ~ Species, data = iris)
##
## $Species
## diff lwr upr p adj
## versicolor-setosa 2.798 2.59422 3.00178 0
## virginica-setosa 4.090 3.88622 4.29378 0
## virginica-versicolor 1.292 1.08822 1.49578 0
We fit a linear model using the
lm()to predict Petal Length from Sepal Length and evaluate its explanatory power
lm_model <- lm(Petal.Length ~ Sepal.Length, data = iris)
summary(lm_model)
##
## Call:
## lm(formula = Petal.Length ~ Sepal.Length, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.47747 -0.59072 -0.00668 0.60484 2.49512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.10144 0.50666 -14.02 <2e-16 ***
## Sepal.Length 1.85843 0.08586 21.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8678 on 148 degrees of freedom
## Multiple R-squared: 0.76, Adjusted R-squared: 0.7583
## F-statistic: 468.6 on 1 and 148 DF, p-value: < 2.2e-16
plot(lm_model)
PCA function
prcomp()reduces dimensionality while preserving as much variance as possible, helping us visualize data structure.
pca <- prcomp(iris[, 1:4], scale. = TRUE)
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7084 0.9560 0.38309 0.14393
## Proportion of Variance 0.7296 0.2285 0.03669 0.00518
## Cumulative Proportion 0.7296 0.9581 0.99482 1.00000
The scree plot shows the variance explained by each principal component.
plot(pca, type = "l", main = "Scree Plot of PCA")
A biplot shows both the scores of each sample and the contribution of each variable.
biplot(pca, col = c("gray", "red"))
A clearer visualization of PCA results using the first two principal components, colored by species.
pca_df <- as.data.frame(pca$x)
pca_df$Species <- iris$Species
library(ggplot2)
ggplot(pca_df, aes(x = PC1, y = PC2, color = Species)) +
geom_point(size = 2) +
labs(title = "PCA of Iris Dataset", x = "PC1", y = "PC2") +
theme_minimal()
if (!require("pacman")) install.packages("pacman")
## Loading required package: pacman
pacman::p_load(pacman, rio)
getwd()
## [1] "C:/Users/FINANCE/Documents/Divine/BYU/CSE 310 Applied Programming/Data Analysis Project"
install.packages("rio")
## Warning: package 'rio' is in use and will not be installed
library(rio)
rio_csv <- import("mbb.csv")
head(rio_csv)
rio_txt <- import("mbb.txt")
head(rio_txt)
rio_xlsx <- import("mbb.xlsx")
head(rio_xlsx)
This line opens the help page for the
View()function in R. It’s not needed in a report/output — it’s for interactive help in the console.
This opens the
rio_csv()dataset in a spreadsheet-style viewer window (only works in interactive RStudio, not when rendering .Rmd).
?View
## starting httpd help server ... done
View(rio_csv)
R’s built-in function for text files (used by rio)
read.table()Reads tabular data from a text file (default separator is any whitespace). header = TRUE: Tells R the first row contains column names.
r_txt1 <- read.table("mbb.txt", header = TRUE)
## Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec,
## : number of items read is not a multiple of the number of columns
sep = "\t"Specifies the tab character as the field separator — required for tab-separated files (.tsv or .txt with tabs).
s for tabs, sep = “,” for commas. R converts missing to “NA”
r_txt2 <- read.table("mbb.txt",
header = TRUE,
sep = "\t")
R’s built-in function for csv files (also used by rio)
CSV files don’t have to specify delimiters for missing data because CSV means “comma separated values” open using
read.csv()fuction
trends.csv <- read.csv("mbb.csv", header = TRUE)
We combined visual and statistical tools to analyze the classic iris dataset.
Petal Length differs significantly by species (ANOVA).
Sepal Length is a strong predictor of Petal Length (Regression).
PCA helps reduce dimensions while revealing structure across species.
This workflow is a powerful template for analyzing other classification datasets.